Fit of biokinetic data in molecular radiotherapy: a machine learning approach

Background In literature are reported different analytical methods (AM) to choose the proper fit model and to fit data of the time-activity curve (TAC). On the other hand, Machine Learning algorithms (ML) are increasingly used for both classification and regression tasks. The aim of this work was to investigate the possibility of employing ML both to classify the most appropriate fit model and to predict the area under the curve (τ). Methods Two different ML systems have been developed for classifying the fit model and to predict the biokinetic parameters. The two systems were trained and tested with synthetic TACs simulating a whole-body Fraction Injected Activity for patients affected by metastatic Differentiated Thyroid Carcinoma, administered with [131I]I-NaI. Test performances, defined as classification accuracy (CA) and percentage difference between the actual and the estimated area under the curve (Δτ), were compared with those obtained using AM varying the number of points (N) of the TACs. A comparison between AM and ML were performed using data of 20 real patients. Results As N varies, CA remains constant for ML (about 98%), while it improves for F-test (from 62 to 92%) and AICc (from 50 to 92%), as N increases. With AM, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Delta \tau$$\end{document}Δτ can reach down to − 67%, while using ML \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Delta \tau$$\end{document}Δτ ranges within ± 25%. Using real TACs, there is a good agreement between τ obtained with ML system and AM. Conclusions The employing of ML systems may be feasible, having both a better classification and a better estimation of biokinetic parameters.

coefficient (TIAc).Many parameters affect the fit model selection: the number of TAC-Ps, the time window, the time sampling and the error that affects each measurement.
In 2007 Glatting et al. [4] proposed the use of the corrected Akaike Information Criterion (AICc) and F-test [5][6][7][8] methods for the comparison of two different models in molecular radiotherapy.In 2013 Kletting et al. [9] showed the need of using a fitting method dedicated to MRT data whereby the physical and biological aspects of the biokinetics curve must be considered in order to compute meaningful parameters.
In recent years, several applications based on Machine Learning (ML) algorithms have been developed in nuclear medicine [10][11][12].The first aim of this study was to implement ML systems to classify the proper curves model and predict, via regression, the TIAc.The secondary aim was to compare the ML performances with the ones obtained with fit algorithm and the analytical methods, AICc and F-test.

Methods
In order to have a better understanding of this manuscript we assume the following definitions: (i) Points of the Time-Activity Curve (TAC-Ps) are the set of points used to determine a TAC (ii) The training set (Tr N ) and the test set (Ts N ) consist of 10,000 and 2000 TAC-Ps, respectively, where each individual TAC-Ps is composed of a number of points equal to N.
The study was subdivided in 3 steps: (i) the generation of synthetic Tr N and Ts N to train and test the machine learning systems.(ii) The training of the ML systems, (iii) the test of the ML systems and performance evaluation in comparison with analytical methods.All these steps are fully described in the sections below and shown in Fig. 1.
At the end, in order to show the applicability of the developed methods, the ML systems were used to calculate the TIAc of 20 patients' TAC-Ps.

Training and test dataset generation
To train and test the ML systems a series of Tr N and Ts N were synthetically generated.The aim of this first step was to generate TAC-Ps simulating whole-body fractions of injected activity (FIA WB ) of patient affected by metastatic Differentiated Thyroid Carcinoma (mDTC), defined as: where A WB (t) is the whole-body activity at time t and A Adm is the administered activity.
The synthetic TAC-Ps were generated with both a Mono-Exponential (MEf ) and Bi-Exponential functions (BEf ) into a band of possible curves, showed in Fig. 2, and their values are listed in Table 1.In the "Appendix" the generation method is reported.
To study how N affects the ML performances, 10 different Tr N were created, with N ranging from 4 to 20, in a time window of 160 h.In order to simulate the patient's hospitalization, the first N-1 points were equally distributed in a range of [0; 54] h, while the last point was set at 160 h.Each Tr N was used to train the ML systems and it was derived from 5000 MEf and 5000 BEf curves.
Similarly, 10 Ts N were generated to test the ML systems; they were derived from 1000 MEf and 1000 BEf curves.The generation procedure is similar to the one of the (1) Tr N, with the addition of Gaussian noise on the FIA coordinates, in order to simulate measuring errors.The amount of added noise is randomly extracted from a Gaussian distribution centred on the FIA coordinates, with a standard deviation of 5% on the FIA values [13].

Implementation and training of ML system
Two different supervised learning systems were implemented using Python scripts: one for a binary classification task (ML1) and one for a regression task (ML2).Scikit-Learn library provided different machine learning models.
Hardware was composed of a personal computer having 10th generation Intel I7 CPU with 16 GB RAM.
ML1 is an ensemble of two Logistic Regression models, which used the Soft Voting method to predict the proper class [14].The TAC-Ps were implemented as features (2N features considering x and y coordinates), and it returns MEf or BEf class as output.
ML2 is an AdaBoost ensemble composed of 5 Gradient Boosting Regressors (GBRs), working sequentially.Each GBR was designed as a chain of 1000 Decision Tree Regressors [14,16].The ML2 task is to predict the TIAcs ( A i , i ), in order to calculate τ .In ML2, TAC-Ps and fit model (MEf or BEf ) were used as input features (2N + 1 features).Each Tr N with a specific number of points individually trains ML2.Therefore, even though the ML2 algorithms are always the same, ML2 consists of 10 ML2 N , depending on the number of points N. To simplify, the subscript N is omitted when referring to ML2 in the rest of this text.
The adopted models from Python Scikit-Learn library were customized with hyperparameters [17] reported in Tables 2 and 3, for the classifier and the regressor respectively.All the other ones that are not reported in tables were set with the default values proposed by the library.

Analytical methods: AICc and F-test
Each TAC-Ps of Ts N were fitted using the Trust-Region algorithm, implemented on the Matlab toolkit, while a stripping algorithm was used to evaluate the fit starting point [18].Each TAC-Ps was fitted both with MEf and BEf and the two results were compared using the AICc and F-Test (with a confidence interval of 95%), as mentioned by Kletting et al. [4].As reported in [7] the AICc is described by the following equation: where K is the number of estimated parameters included in the model, N is the number of points, and SS is the sum of squared deviations (between the measurement and the fitted curve).The model with the lower AICc score is the model that is more likely to be correct.Once AICc has been calculated for each fitting model, it is possible to compute the probability that the correct model (i) has been chosen, as follows: where is the difference between the AICc score.
The model with the minimal AICc value between all candidate models indicates the best model.
The F-test is a hypothesis test in which only two models can be compared using the following equation [4,7]: where SS are the sums of squared deviations (for MEf and BEf ) and DF are the degrees of freedom (DF = N-K).The decision to accept or discard the MEf model is based on the p-value calculated from the F ratio.For a P value below the chosen significance level (2)  (0.05) the MEf model is rejected and therefore the more complex model is assumed to fit the data in a significantly better way.

Test and performance evaluation
A tenfold cross validation of ML1 system has been performed using 80% of the Tr N in training and the remaining 20% for testing.The classification accuracy (CA) of the correct fit model was chosen as performance estimator.It is defined as: where CC ME and CC BE are the number of correct classifications for mono-and bi-expo- nential functions respectively, and 2000 is the number of TACs.
The ability of ML1 system, AICc and F-test to correctly classify the model was evaluated, determining the CA using Ts N as input.
The chosen parameter to evaluate the goodness of TIAc prediction was the area under each TAC ( τ ) , assessed through the following equation [15]: where n is equal to 1 for MEf and 2 for BEf and A i and i are the parameters obtained from the fit.
Two different tests were performed to evaluate the goodness of TIAc prediction: the first had the aim to evaluate the best possible performances of the fit algorithm and ML2, considering a classification with no errors.Meanwhile the second test evaluated the performances of the chains ML1 + ML2, fit + AICc and fit + F-Test, considering all the classification errors.In addition, a fourth chain was considered, named fit + AICc-W.In this method, an average model was considered, taking into account the probability of each model given by the parameter w i .The area τ was calculated as follows: The performances of the two tests were evaluated assessing the distributions of the percentage differences (�τ ) between the calculated τ and true one, the interquartile range and the Maximum Error Range (MER).The MER was represented as the range between the minimum and maximum value of �τ.

Test on patient data
The whole-body TAC-Ps of 20 patients, affected by mDTC, were evaluated employing both methods: Fit + F-Test and ML1 + ML2.All TAC data were obtained by calculating the geometric mean of measurements taken with a plastic scintillator placed at a distance of 4 m from the patient and calibrated in terms of H* (10), providing a value of the dose rate in µSv/h.Each TAC-Ps had 5 or 6 points and they were uniformly classified as 10 MEf and 10 BEf by the F-Test. (5

Results
The computation time to train ML1 and ML2 sequentially is about eight minutes, while the computer takes few seconds to perform the test with 2000 curves.The same hardware takes about 10 min to perform fit + F-test and fit + AICc using the same datasets.
The CA in the cross validation, varying the number of points (N), is reasonably constant at a value of 99%. Figure 3 shows the CA of the three systems using the test dataset that is about constant at the value of 98% for ML1, while it increases from 50 to 92% for AICc and from 62 to 92% for F-test, when the number of points increases.
Figure 4 shows the �τ distributions obtained using a classification without errors.The median values remain approximately constant around 0% for both methods, and also the Fig. 3 Classification accuracy obtained by three methods varying the number of points of the TACs simulated Fig. 4 Boxplots representing the Δτ distributions and the MER bands obtained using the fit algorithm (blue) and ML2 (red).For each set of measures the performances of the ML system and the fit algorithm were compared only for the assessment of the AUC, since it was known a priori if the points follow a mono-exponential or a bi-exponential trend.Through the estimation of the fit model parameters the AUC was calculated and the distributions were represented width of the distribution between the first and third interquartile is independent from the number of points with values equal to 4.5% and 3.6% for ML2 and the fit algorithm respectively.MER are represented by the bands in Fig. 4, which become thinner as the number of points increases, and their ranges vary from [− 14.2; 14.5]% to [− 8.1; 7.5]%, and from [− 16.8; 22.1]% to [− 6.7; 17.3]% for ML2 and fit algorithm respectively.
The �τ distributions shown in Fig. 5 were obtained including any classification fail- ure from AICc, AICc-W, F-test and ML1.Even in this case, varying N, the median values are approximately constant for all the considered methods, while interquartile range varies from 24.7 to 3.1% for the fit + AICc, from 24.6 to 3.1% for the fit + AICc-W and from 15.8% to 3.1% for the fit + F-test.The width remains constant around 4.5% for the ML1 + ML2 system.With the increase of the number of points, MER varies from [− 66.When N < 8, �τ distributions obtained by AICc and F-Test are asymmetrical, while for N > 8 the distribution width, obtained with the three methods, are equally distributed.The MER band of ML2 is always thinner than the other two.
Considering the TAC-Ps of 20 patients, 9 out of the 10 MEfs were classified as BEfs, while all the 10 presented BEfs were confirmed to be BEfs by ML1. Figure 6 shows the correlation between the τ calculated with ML1 + ML2 and the one calculated with the fit Algorithm + F-Test on the 20 patients data, obtaining an R 2 equal to 0.93, and a difference percentage of the results in a [− 26.9;12.8]%range, with a median value of 0.2%.

Discussion
The obtained results show the feasibility of using machine learning system both to classify the proper model and predict the TIAc with an improvement of the performances.
Fig. 5 Boxplots representing the Δτ distributions and the MER bands obtained using the fit + AICc (blue), fit + AICc-W (green), fit + F-test (grey) and ML1 + ML2 (red).For the analytical methods, points were fitted with both models and the AICc and F-Test determine which AUC should be considered.For ML, first the best fit model was predicted and then the model parameters in order to calculate the AUC The training phase plays a crucial role into obtaining results close to true values, and one of the major advantages shown by ML is the possibility for the model to be trained with error-free data and then tested with data simulating real curves, resulting in a high CA (about 98.5%) and a �τ less than ± 25%.This aspect can be par- ticularly advantageous in order to train the system without knowing the error of the measurement system "a priori".Training with error-free data is not only advantageous but also recommended, as reported by Geron et al. [14].The training phase should be conducted with the cleanest possible data, to avoid confusing the system and to prevent overfitting.To confirm this, training was performed with noisy data similar to the ones used in Tr N , and a decrease in classification accuracy of around 5% was recorded.
The time spent for training is very short, and it needs to be executed just once before running classifications and regressions.
The main difference between ML and the analytical method is the workflow.The former predicts the model and then assesses the biokinetic parameters.Instead, the latter calculates the biokinetic parameters for both models first and then chooses the optimal one.This aspect is crucial: the analytical method is not optimized because it necessarily performs both fit models and, in addition, the choice of the model is strongly dependent on the algorithm fit and on the goodness of the TIAc.
The cross-validation results are better than the ones obtained in the test phase, but this was to be expected: in fact, Ts N simulates the measurements errors (thanks to the addition of Gaussian noise), and so the results get worse without being associated with an overfitting condition.Figure 3 shows that CA is not dependent on the number of points for ML1, while it is highly dependent for AICc and F-test.The analytical methods show a low accuracy when the number of points is near to the limit of applicability (from Eqs. 3 and 4 it is possible to notice that F-test and AICc are available only if the number of points is at least 5 and 6 respectively).As reported by Kletting et al. in [19], F-test and AICc classify as mono-exponential most of the bi-exponential curves (CA = 61.0%and CA = 50.0%)if the number of points is 5 or 6 respectively.In addition, the CA obtained with ML1, equal to 98.5%, is higher than the results obtained with AICc and F-test.
Figure 4 shows the best results obtainable through the two systems, dedicated to the assessment of biokinetic parameters.Considering an error-free classification, the width of the �τ distributions, obtained with ML2 and with the fit algorithm, are sim- ilar: hence, in most cases the two systems are equivalent.The MER obtained from ML2, represented by the red band in Fig. 4, does not exceed ± 15% and it is smaller than the one obtained using the fit algorithm (about ± 20% and represented by the blue band).
Glatting et al. in [4] use fitting algorithms that take into account uncertainties on individual points of the TAC.The fits used in the present study, on the other hand, do not consider such errors and, in future works, it will be necessary to perform a comparison.The use of these algorithms will likely lead to a convergence of performance between analytical methods and ML methods, but a priori knowledge of errors is not always possible.Therefore, "a priori"-trained ML systems remain a more general method for both model selection and obtaining kinetic parameters.
When the input to determine the biokinetic parameters are the classification results obtained through ML1, AICc, AICc-W and F-test, performances get worse for all the examined four systems (Fig. 5), but the ML1 + ML2 system keeps MER within ± 25%.The analytical methods tend to underestimate the area under the curve when the number of points is lower than 8.This effect is due to the tendency of the two methods to classify bi-exponential curves as mono-exponential.
The use of ML in this field can also lead to the possibility of using it in a more radical way.For example, an attempt was made to train the ML2 system to obtain the area under the curve directly as the output, instead of the fit parameters.The obtained results were similar to those shown in Figs. 4 and 5, but, according to the authors, this is an incorrect way of using ML systems: in fact they are already considered black boxes, and using the system in the aforementioned way, without the possibility to assure the goodness of the fit, can be challenging to justify in the clinical use of dosimetry.
It is important to underline that the performance of ML systems has proven to be fairly independent from the number of points composing the TACs, whereas a more pronounced dependence has been observed in relation to errors in the data.Some tests were conducted by varying the error on the points, resulting in a decrease in performance.These analyses were not reported in the results section because they go beyond the scope of this work, which is to demonstrate the feasibility of using ML systems in dosimetry in molecular radiotherapy, for model selection and the calculation of the area under the TAC curve.
Figure 6 shows the applicability of the ML systems on real data: it confirms that, when ML1 and F-test classify the curve with the same model, the assessed biokinetic parameters are very similar.On the other hand, when classification does not coincide, the distance between the two results increases.An example of these results is shown in Fig. 7. Considering CA for 5 points (shown in Fig. 3) and the graph in Fig. 5, there is a high probability that the correct classification is given by ML1, and consequently the ML2 result could better represent the biokinetic curve of the patient.
ML1 and ML2 have been trained with curves simulating the whole-body FIA of patients administered with [ 131 I]-NaI, but can be trained to perform cumulated activity calculations also for different radionuclides in metabolic radiotherapy, and with different dosimetry calculation methods (i.e.voxel dosimetry).
In 2017 Sarrut et al. [20] showed that the selection of the proper fit model reduces the number of fit failures in voxel dosimetry (defined as the case in which the optimizer does not converge and reaches the maximum number of iterations, or if the R 2 is lower than a certain threshold).The brief time to perform the 2000 test fits and the independence of the performance from the number of points make these systems suitable for this application.
It is necessary to emphasize that each new application (such as the study of new curve models) needs a new training, and that the performance is strongly dependent on the similarity between the training samples and the clinical case.In addition, in this study, Gaussian-type errors were used on the data because the purpose of the work was to demonstrate the feasibility of using ML systems.However, in future studies, it will be necessary to investigate how data errors affect performance using different thresholds and various types of distributions, such as the Poisson distribution.This is a study on the feasibility of using the ML system for classifying the correct fit model and predicting the TIAc, and it doesn't have the purpose to study which ML algorithm is the optimal one in performing these two tasks.The choice of the logistic regression and the AdaBoost algorithm is arbitrary, and a subsequent study is necessary to identify which algorithms could increase the performance results.
In conclusion, to the knowledge of the authors, this study is the first to propose the use of ML systems for TIAc calculations for dosimetry in MRT.As demonstrated, the use is feasible and promising, but it requires further investigations in different fields.Therefore, investigations will be performed in order to train such systems with different algorithms, treatments, radionuclides, curve models and for voxel dosimetry.

Method for synthetic TAC-Ps generation
The datasets (TR N and TS N ) used to train and test the systems consist of TAC-Ps, where each point composing them simulates the FIA WB of patients with metastatic differentiated thyroid carcinoma (mDTC), administered with [ 131 I]I-NaI.Each point in the TAC-Ps simulates the result of the geometric mean between measurements acquired from the anterior and posterior positions, using an external probe located 4 m from the patient.The TIAc values for the two bands are reported in Table 1 and are obtained using the upper and lower bounds of the TACs calculated for 50 patients administered in our institution and subjected to dosimetric study.
To generate each of the TAC-Ps, the following method was used for both mono-exponential and bi-exponential trends.For BEf, four time points were selected at 0, 24, 90, and 160 h and, for each of these times, a range of possible values was determined by the band showed in Fig. 2. From each of these ranges, a random FIA WB value was extracted, resulting in 4 values corresponding to the 4 different times.These values were fitted with a bi-exponential curve, obtaining the parameters of the generated curve.Finally, a check is performed to ensure that the generated curve does not intersect the bands up to 500 h.If the check is successful, the curve parameters are recorded, and the TAC-Ps are created with the chosen number of points for either Tr N or Ts N .The procedure for TAC-Ps with a mono-exponential trend is identical, with the only difference being that two points are extracted at times 0 and 90 h.

Fig. 6
Fig. 6 Comparison between the τ values obtained with ML1 + ML2 and Fit Algorithm + F-test.The red points represent the area under the curve of TACs classified as mono-exponential for the F-test while the blue points represent the area under the curve of TACs classified as bi-exponential for the F-test

Fig. 7
Fig. 7 TACs of Wholebody FIA of three different patients and fit obtained with fit + F-test and ML1 + ML2 systems.a The two model agree classifying and fitting the data with a MEf.b Also in this case ML1 and F-test classify with a BEf.c In the third case the F-test classify the curve as MEf while it is a BEf for ML2.In this case the two obtained fit curves are significantly different

Table 2
Scikit-Learn hyperparameters used for the customization of ML1

Table 3
Scikit-Learn hyperparameters used for the customization of ML2